\(^1\) GenomEast platform, IGBMC

1 - Log in to GalaxEast

2 - History

2.1 - Create a new history

2.2 - Change the name of the history to “DNA-seq data analysis”

  • Change the name of the new history to “DNA-seq data analysis” by clicking on “Unnamed history” on top of the history panel. Then type “DNA-seq data analysis” and [ENTER].

3 - Importing files from your computer to Galaxy

  • Download the file “sample.bed.gz” following this link and upload it to Galaxy.
  • The genome is: Mouse (mm9)
  • The format is: bed

You should be in the history “DNA-seq data analysis” (Switch to it if needed)

  • Download the file “sample.bed” and upload it into GalaxEast in the history called “DNA-seq data analysis”
  • Click on “Upload Data”
  • Drag and drop the file sample.bed to the window that popped up.
  • The genome is : Mouse (mm9)
  • The format is : bed

4 - Remove a dataset

  • Remove the dataset sample.bed from your history by clicking on the button

  • You are told that your history is empty. Look at the size of your history
    • Click on “Show deleted” in the top of the history panel. Remove definitely the file from the disk by clicking on “Purge All Deleted Content”.
    • Click on “Show active

5 - Running a tool

  • Download the two files CRN-107_11-R1.fastq.gz and CRN-107_11-R2.fastq.gz following this link.
  • Import them to your history called “DNA-seq data analysis”
    • The genome is: Human (hg19)
    • The format: <auto detect>
  • Click on “Upload Data”
  • Drag and drop the two fastq files “CRN-107_11-R1.fastq.gz” and “CRN-107_11-R2.fastq.gz
  • Select/Enter Genome for both datasets as: hg19
  • Click on Start

  • Use the tool “FastQC Read Quality reports” to compute quality analysis on the datasets “CRN-107_11-R1.fastq” and “CRN-107_11-R2.fastq
    • Use default parameters.

One can run the same tool with the same set of parameters on several files at the same time.

What is the quality encoding of the two fastq files?

6 - Running tools without a workflow

Analyze CRN-107 data from reads to variant annotation.

Run the following tools:

  • BWA mem to align reads to the reference genome
  • Picard markduplicates to identify duplicated reads
  • Freebayes to detect variants
  • snpEff to annotate variants

To run the tools you will need the following files:

  • CRN-107_11-R1.fastq
  • CRN-107_11-R2.fastq
  • CaptureDesign_chr4.bed (download it from here)

Here are the parameters to use for each of the tools. All parameters not mentioned are to be used with default values.

  • Map with BWA-MEM - map medium and long reads (> 100 bp) against reference genome
    • Using reference genome: Human (Homo sapiens) (b37): hg19
    • Single or Paired-end reads: Paired
    • Select first set of reads: CRN-107_11-R1.fastq.gz
    • Select second set of reads: CRN-107_11-R2.fastq.gz
    • Set read groups information? Set read groups (Picard style)
      • Read group identifier (ID): Auto-assign Yes
      • Read group sample name (SM): Auto-assign Yes
      • Library name (LB): Auto-assign Yes
      • Platform/technology used to produce the reads (PL): ILLUMINA
      • Platform unit (PU): HS026.2
      • Sequencing center that produced the read (CN): Genomeast
      • Description (DS): CRN-107
      • Predicted median insert size (PI): 250
      • Date that run was produced (DT): 2017-12-13
  • MarkDuplicates examine aligned records in BAM datasets to locate duplicate molecules.
    • Select SAM/BAM dataset or dataset collection: output of BWA mem
    • Select validation stringency: Silent
  • FreeBayes bayesian genetic variant detector
    • BAM or CRAM dataset: output (bam) of markduplicates
    • Using reference genome: hg19
    • Limit analysis to regions in this BED dataset: CaptureDesign_chr4.bed
  • SnpEff Variant effect and annotation
    • Sequence changes (SNPs, MNPs, InDels): output of FreeBayes (VCF)
    • Input format: VCF
    • Output format: VCF (only if input is VCF)
    • Genome source: Downloaded on demand
      • Snpff Genome Version Name (e.g. GRCh38.86): hg19
  • VCFtoTab-delimited: Convert VCF data into TAB-delimited format
    • Select VCF dataset to convert: output of SnpEff
  • How many variants are called?

7 - Create a workflow out of an existing history

One can create a workflow from an existing history going to the history button and selecting “Extract Workflow”.

7.1 - Extract a workflow out of the history called “DNA-seq data analysis”

7.2 - Rename the workflow “DNA-seq data analysis”

8 - Edit a workflow with the workflow editor

8.1 - Open the workflow editor with the workflow “DNA-seq data analysis”

  • Click on Workflow (top menu)
  • Click on DNA-seq data analysis in the list of workflows
  • Select edit

An interactive view of the workflow is displayed:

You can move the boxes to make the workflow clearer:

8.2 - Add steps to the workflow

Your workflow should look like this before editing:

Add the following tools:

  • Samtools flagstat to compute mapping statistics (after BWA mem)
  • Filter SAM or BAM, output SAM or BAM to select aligned reads with a mapping quality >= 20 (after MarkDuplicates)
  • Samtools flagstat to compute mapping statistics after removing reads with low mapping qualities (after Filter)

Here are the parameters to use for each of the tools:

  • Flagstat tabulate descriptive stats for BAM dataset
    • BAM File to Convert: output of BWA mem
  • Filter SAM or BAM, output SAM or BAM files on FLAG MAPQ RG LN or by region
    • SAM or BAM file to filter: output of Picard MarkDuplicates
    • Minimum MAPQ quality score: 20
  • Flagstat tabulate descriptive stats for BAM dataset
    • BAM File to Convert: output of Filter

The final workflow should look like this (new tools are in purple boxes):

Save the workflow once you are done editing it:

9 - Run a workflow

9.1 - Import files

Import the following files from the data library “DNA-seq test datasets” to a new history:

  • CRN-107_11-R1.fastq
  • CRN-107_11-R2.fastq
  • CaptureDesign_chr4.bed

9.2 - Run the workflow DNA-seq data analysis

  • Choose the right files.
  • Check the parameters.

  • How many reads are discarded due to the low mapping quality?